Weather data exploration

HAN Duqing

INTRODUCTION:

Weather is important to everyone. Every time when we go out, we will ask questions about weather: What's the temperature outside? Is it going to rain? And that's why we choose to do data exploration on weather.

Historical Hourly Weather Data:

The dataset contains 5 years of hourly measured data of various weather attributes, such as temperature, humidity, pressure, etc. (more than 40,000 rows) This data is available for 30 US and Canadian Cities, as well as 6 Israeli cities.

URL:http://www.kaggle.com/selfishgene/historical-hourly-weather-data

1.LOAD DATA

Fisrt of all, connect to hive.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from pyhive import hive
conn = hive.connect("quickstart.cloudera", username="cloudera")
cursor = conn.cursor()

cursor.execute('DROP DATABASE IF EXISTS weather CASCADE') # if one was created before
cursor.execute('CREATE DATABASE weather')

Create the table 'humidity' and load data into the table.

In [2]:
cursor.execute("CREATE TABLE IF NOT EXISTS humidity (\
datetime STRING,\
Vancouver INT,Portland INT,SanFrancisco INT,Seattle INT,\
LosAngeles  INT,SanDiego INT,LasVegas INT,Phoenix INT,\
Albuquerque INT,Denver INT,SanAntonio INT,Dallas INT,\
Houston INT,KansasCity INT,Minneapolis INT,SaintLouis INT,\
Chicago INT,Nashville INT,Indianapolis INT,Atlanta INT,\
Detroit INT,Jacksonville INT,Charlotte INT,Miami INT,\
Pittsburgh INT,Toronto INT,Philadelphia INT,NewYork INT,\
Montreal INT,Boston INT,Beersheba INT,TelAvivDistrict INT,\
Eilat INT,Haifa INT,Nahariyya INT,Jerusalem INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/humidity.csv' OVERWRITE INTO TABLE humidity")


cursor.execute('ALTER TABLE humidity set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM humidity')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
Out[2]:
datetime Vancouver Portland SanFrancisco Seattle LosAngeles SanDiego LasVegas Phoenix Albuquerque ... Philadelphia NewYork Montreal Boston Beersheba TelAvivDistrict Eilat Haifa Nahariyya Jerusalem
0 2012-10-01 12:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN
1 2012-10-01 13:00:00 76.0 81.0 88.0 81.0 88.0 82.0 22.0 23.0 50.0 ... 71.0 58.0 93.0 68.0 50.0 63.0 22.0 51.0 51.0 50.0
2 2012-10-01 14:00:00 76.0 80.0 87.0 80.0 88.0 81.0 21.0 23.0 49.0 ... 70.0 57.0 91.0 68.0 51.0 62.0 22.0 51.0 51.0 50.0
3 2012-10-01 15:00:00 76.0 80.0 86.0 80.0 88.0 81.0 21.0 23.0 49.0 ... 70.0 57.0 87.0 68.0 51.0 62.0 22.0 51.0 51.0 50.0
4 2012-10-01 16:00:00 77.0 80.0 85.0 79.0 88.0 81.0 21.0 23.0 49.0 ... 69.0 57.0 84.0 68.0 52.0 62.0 22.0 51.0 51.0 50.0

5 rows × 37 columns

Create the table 'pressure' and load data into the table.

In [3]:
cursor.execute("CREATE TABLE IF NOT EXISTS pressure (\
datetime STRING,\
Vancouver INT,\
Portland INT,\
SanFrancisco INT,\
Seattle INT,\
LosAngeles  INT,\
SanDiego INT,\
LasVegas INT,\
Phoenix INT,\
Albuquerque INT,\
Denver INT,\
SanAntonio INT,\
Dallas INT,\
Houston INT,\
KansasCity INT,\
Minneapolis INT,\
SaintLouis INT,\
Chicago INT,\
Nashville INT,\
Indianapolis INT,\
Atlanta INT,\
Detroit INT,\
Jacksonville INT,\
Charlotte INT,\
Miami INT,\
Pittsburgh INT,\
Toronto INT,\
Philadelphia INT,\
NewYork INT,\
Montreal INT,\
Boston INT,\
Beersheba INT,\
TelAvivDistrict INT,\
Eilat INT,\
Haifa INT,\
Nahariyya INT,\
Jerusalem INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")

cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/pressure.csv' OVERWRITE INTO TABLE pressure")
cursor.execute('ALTER TABLE pressure set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM pressure')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
Out[3]:
datetime Vancouver Portland SanFrancisco Seattle LosAngeles SanDiego LasVegas Phoenix Albuquerque ... Philadelphia NewYork Montreal Boston Beersheba TelAvivDistrict Eilat Haifa Nahariyya Jerusalem
0 2012-10-01 12:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 1011.0 NaN NaN NaN
1 2012-10-01 13:00:00 NaN 1024.0 1009.0 1027.0 1013.0 1013.0 1018.0 1013.0 1024.0 ... 1014.0 1012.0 1001.0 1014.0 984.0 1012.0 1010.0 1013.0 1013.0 990.0
2 2012-10-01 14:00:00 NaN 1024.0 1009.0 1027.0 1013.0 1013.0 1018.0 1013.0 1024.0 ... 1014.0 1012.0 986.0 1014.0 984.0 1012.0 1010.0 1013.0 1013.0 990.0
3 2012-10-01 15:00:00 NaN 1024.0 1009.0 1028.0 1013.0 1013.0 1018.0 1013.0 1024.0 ... 1014.0 1012.0 945.0 1014.0 984.0 1012.0 1010.0 1013.0 1013.0 990.0
4 2012-10-01 16:00:00 NaN 1024.0 1009.0 1028.0 1013.0 1013.0 1018.0 1013.0 1024.0 ... 1014.0 1012.0 904.0 1014.0 984.0 1012.0 1010.0 1013.0 1013.0 990.0

5 rows × 37 columns

Create the table 'temperature' and load data into the table.

In [4]:
cursor.execute("CREATE TABLE IF NOT EXISTS temperature (\
datetime STRING,\
Vancouver FLOAT,\
Portland FLOAT,\
SanFrancisco FLOAT,\
Seattle FLOAT,\
LosAngeles  FLOAT,\
SanDiego FLOAT,\
LasVegas FLOAT,\
Phoenix FLOAT,\
Albuquerque FLOAT,\
Denver FLOAT,\
SanAntonio FLOAT,\
Dallas FLOAT,\
Houston FLOAT,\
KansasCity FLOAT,\
Minneapolis FLOAT,\
SaintLouis FLOAT,\
Chicago FLOAT,\
Nashville FLOAT,\
Indianapolis FLOAT,\
Atlanta FLOAT,\
Detroit FLOAT,\
Jacksonville FLOAT,\
Charlotte FLOAT,\
Miami FLOAT,\
Pittsburgh FLOAT,\
Toronto FLOAT,\
Philadelphia FLOAT,\
NewYork FLOAT,\
Montreal FLOAT,\
Boston FLOAT,\
Beersheba FLOAT,\
TelAvivDistrict FLOAT,\
Eilat FLOAT,\
Haifa FLOAT,\
Nahariyya FLOAT,\
Jerusalem FLOAT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")

cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/temperature.csv' OVERWRITE INTO TABLE temperature")
cursor.execute('ALTER TABLE temperature set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM temperature')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
Out[4]:
datetime Vancouver Portland SanFrancisco Seattle LosAngeles SanDiego LasVegas Phoenix Albuquerque ... Philadelphia NewYork Montreal Boston Beersheba TelAvivDistrict Eilat Haifa Nahariyya Jerusalem
0 2012-10-01 12:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 309.100006 NaN NaN NaN
1 2012-10-01 13:00:00 284.630005 282.079987 289.480011 281.799988 291.869995 291.529999 293.410004 296.600006 285.119995 ... 285.630005 288.220001 285.829987 287.170013 307.589996 305.470001 310.579987 304.399994 304.399994 303.5
2 2012-10-01 14:00:00 284.629028 282.083252 289.475006 281.797211 291.868195 291.533508 293.403137 296.608521 285.154572 ... 285.663208 288.247681 285.834656 287.186096 307.589996 304.309998 310.495758 304.399994 304.399994 303.5
3 2012-10-01 15:00:00 284.626984 282.091858 289.460632 281.789825 291.862854 291.543365 293.392181 296.631500 285.233948 ... 285.756836 288.326935 285.847778 287.231659 307.391510 304.281830 310.411530 304.399994 304.399994 303.5
4 2012-10-01 16:00:00 284.624969 282.100494 289.446259 281.782440 291.857513 291.553223 293.381226 296.654480 285.313354 ... 285.850433 288.406189 285.860931 287.277252 307.145203 304.238007 310.327301 304.399994 304.399994 303.5

5 rows × 37 columns

Create the table 'description' and load data into the table.

In [5]:
cursor.execute("CREATE TABLE IF NOT EXISTS description (\
datetime STRING,\
Vancouver STRING,\
Portland STRING,\
SanFrancisco STRING,\
Seattle STRING,\
LosAngeles  STRING,\
SanDiego STRING,\
LasVegas STRING,\
Phoenix STRING,\
Albuquerque STRING,\
Denver STRING,\
SanAntonio STRING,\
Dallas STRING,\
Houston STRING,\
KansasCity STRING,\
Minneapolis STRING,\
SaintLouis STRING,\
Chicago STRING,\
Nashville STRING,\
Indianapolis STRING,\
Atlanta STRING,\
Detroit STRING,\
Jacksonville STRING,\
Charlotte STRING,\
Miami STRING,\
Pittsburgh STRING,\
Toronto STRING,\
Philadelphia STRING,\
NewYork STRING,\
Montreal STRING,\
Boston STRING,\
Beersheba STRING,\
TelAvivDistrict STRING,\
Eilat STRING,\
Haifa STRING,\
Nahariyya STRING,\
Jerusalem STRING)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")

cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/weather_description.csv' OVERWRITE INTO TABLE description")
cursor.execute('ALTER TABLE description set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM description')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
Out[5]:
datetime Vancouver Portland SanFrancisco Seattle LosAngeles SanDiego LasVegas Phoenix Albuquerque ... Philadelphia NewYork Montreal Boston Beersheba TelAvivDistrict Eilat Haifa Nahariyya Jerusalem
0 2012-10-01 12:00:00 ... haze
1 2012-10-01 13:00:00 mist scattered clouds light rain sky is clear mist sky is clear sky is clear sky is clear sky is clear ... broken clouds few clouds overcast clouds sky is clear sky is clear sky is clear haze sky is clear sky is clear sky is clear
2 2012-10-01 14:00:00 broken clouds scattered clouds sky is clear sky is clear sky is clear sky is clear sky is clear sky is clear sky is clear ... broken clouds few clouds sky is clear few clouds sky is clear sky is clear broken clouds overcast clouds sky is clear overcast clouds
3 2012-10-01 15:00:00 broken clouds scattered clouds sky is clear sky is clear sky is clear sky is clear sky is clear sky is clear sky is clear ... broken clouds few clouds sky is clear few clouds overcast clouds sky is clear broken clouds overcast clouds overcast clouds overcast clouds
4 2012-10-01 16:00:00 broken clouds scattered clouds sky is clear sky is clear sky is clear sky is clear sky is clear sky is clear sky is clear ... broken clouds few clouds sky is clear few clouds overcast clouds sky is clear broken clouds overcast clouds overcast clouds overcast clouds

5 rows × 37 columns

Create the table 'windspeed'and load data into the table.

In [6]:
cursor.execute("CREATE TABLE IF NOT EXISTS windspeed (\
datetime STRING,\
Vancouver INT,\
Portland INT,\
SanFrancisco INT,\
Seattle INT,\
LosAngeles  INT,\
SanDiego INT,\
LasVegas INT,\
Phoenix INT,\
Albuquerque INT,\
Denver INT,\
SanAntonio INT,\
Dallas INT,\
Houston INT,\
KansasCity INT,\
Minneapolis INT,\
SaintLouis INT,\
Chicago INT,\
Nashville INT,\
Indianapolis INT,\
Atlanta INT,\
Detroit INT,\
Jacksonville INT,\
Charlotte INT,\
Miami INT,\
Pittsburgh INT,\
Toronto INT,\
Philadelphia INT,\
NewYork INT,\
Montreal INT,\
Boston INT,\
Beersheba INT,\
TelAvivDistrict INT,\
Eilat INT,\
Haifa INT,\
Nahariyya INT,\
Jerusalem INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/wind_speed.csv' OVERWRITE INTO TABLE windspeed")
cursor.execute('ALTER TABLE windspeed set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM windspeed')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
Out[6]:
datetime Vancouver Portland SanFrancisco Seattle LosAngeles SanDiego LasVegas Phoenix Albuquerque ... Philadelphia NewYork Montreal Boston Beersheba TelAvivDistrict Eilat Haifa Nahariyya Jerusalem
0 2012-10-01 12:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 8.0 NaN NaN NaN
1 2012-10-01 13:00:00 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 4.0 ... 4.0 7.0 4.0 3.0 1.0 0.0 8.0 2.0 2.0 2.0
2 2012-10-01 14:00:00 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 4.0 ... 4.0 7.0 4.0 3.0 3.0 0.0 8.0 2.0 2.0 2.0
3 2012-10-01 15:00:00 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 4.0 ... 3.0 7.0 4.0 3.0 3.0 0.0 8.0 2.0 2.0 2.0
4 2012-10-01 16:00:00 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 4.0 ... 3.0 7.0 4.0 3.0 3.0 0.0 8.0 2.0 2.0 2.0

5 rows × 37 columns

2.BOSTON

2.1We merge two tables into one table 'Boston' using JOIN based on datatime.

In [7]:
cursor.execute('DROP TABLE IF EXISTS Boston') # if one was created before
cursor.execute("CREATE TABLE IF NOT EXISTS Boston (\
datetime STRING,humidity INT,pressure INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")

cursor.execute("INSERT INTO TABLE boston SELECT a.datetime, a.Boston, b.Boston FROM humidity a JOIN pressure b ON (a.datetime=b.datetime)")
cursor.execute("select * from boston")
pd.DataFrame(cursor.fetchall())
Out[7]:
0 1 2
0 2012-10-01 12:00:00 NaN NaN
1 2012-10-01 13:00:00 68.0 1014.0
2 2012-10-01 14:00:00 68.0 1014.0
3 2012-10-01 15:00:00 68.0 1014.0
4 2012-10-01 16:00:00 68.0 1014.0
... ... ... ...
45248 2017-11-29 20:00:00 37.0 1017.0
45249 2017-11-29 21:00:00 74.0 1019.0
45250 2017-11-29 22:00:00 74.0 1019.0
45251 2017-11-29 23:00:00 56.0 1022.0
45252 2017-11-30 00:00:00 56.0 1023.0

45253 rows × 3 columns

We merge other tables in the same way and we can get a new table 'Boston' with all the weather features.

In [8]:
cursor.execute("ALTER TABLE Boston ADD COLUMNS (temperature double)")
cursor.execute("INSERT OVERWRITE TABLE Boston SELECT a.datetime, a.humidity, a.pressure,b.Boston FROM boston a JOIN temperature b ON (a.datetime=b.datetime)")
cursor.execute("ALTER TABLE Boston ADD COLUMNS (windspeed int)")
cursor.execute("INSERT OVERWRITE TABLE Boston SELECT a.datetime, a.humidity, a.pressure,a.temperature,b.Boston FROM boston a JOIN windspeed b ON (a.datetime=b.datetime)")
cursor.execute("ALTER TABLE Boston ADD COLUMNS (description string)")
cursor.execute("INSERT OVERWRITE TABLE Boston SELECT a.datetime, a.humidity, a.pressure,a.temperature,a.windspeed,b.Boston FROM boston a JOIN description b ON (a.datetime=b.datetime)")
cursor.execute("select * from boston")
pd.DataFrame(cursor.fetchall())
Out[8]:
0 1 2 3 4 5
0 2012-10-01 12:00:00 NaN NaN NaN NaN
1 2012-10-01 13:00:00 68.0 1014.0 287.170013 3.0 sky is clear
2 2012-10-01 14:00:00 68.0 1014.0 287.186096 3.0 few clouds
3 2012-10-01 15:00:00 68.0 1014.0 287.231659 3.0 few clouds
4 2012-10-01 16:00:00 68.0 1014.0 287.277252 3.0 few clouds
... ... ... ... ... ... ...
45248 2017-11-29 20:00:00 37.0 1017.0 288.079987 8.0 broken clouds
45249 2017-11-29 21:00:00 74.0 1019.0 286.019989 6.0 broken clouds
45250 2017-11-29 22:00:00 74.0 1019.0 283.940002 7.0 broken clouds
45251 2017-11-29 23:00:00 56.0 1022.0 282.170013 2.0 few clouds
45252 2017-11-30 00:00:00 56.0 1023.0 280.649994 2.0 broken clouds

45253 rows × 6 columns

In [9]:
cursor.execute("select * from boston")
boston=pd.DataFrame(cursor.fetchall())
boston.columns=['datetime','humidity','pressure','temperature','windspeed','description']
boston
Out[9]:
datetime humidity pressure temperature windspeed description
0 2012-10-01 12:00:00 NaN NaN NaN NaN
1 2012-10-01 13:00:00 68.0 1014.0 287.170013 3.0 sky is clear
2 2012-10-01 14:00:00 68.0 1014.0 287.186096 3.0 few clouds
3 2012-10-01 15:00:00 68.0 1014.0 287.231659 3.0 few clouds
4 2012-10-01 16:00:00 68.0 1014.0 287.277252 3.0 few clouds
... ... ... ... ... ... ...
45248 2017-11-29 20:00:00 37.0 1017.0 288.079987 8.0 broken clouds
45249 2017-11-29 21:00:00 74.0 1019.0 286.019989 6.0 broken clouds
45250 2017-11-29 22:00:00 74.0 1019.0 283.940002 7.0 broken clouds
45251 2017-11-29 23:00:00 56.0 1022.0 282.170013 2.0 few clouds
45252 2017-11-30 00:00:00 56.0 1023.0 280.649994 2.0 broken clouds

45253 rows × 6 columns

2.2 Use GROUP BY to group the same weather. We find out that the most of the weather is 'sky is clear'.

In [10]:
cursor.execute('SELECT description,COUNT(description)  FROM Boston GROUP BY description')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['description', 'Count description']))
rows.sort_values(by="Count description",inplace=True,ascending=False)
plt.figure(figsize=(18,9))
plt.bar(rows["description"],rows["Count description"])
plt.xticks(rotation=70,fontsize=12)
print('Repartition for weather by description')
Repartition for weather by description

sky is clear is the most frequent weather condition in Boston, followed by light rain and mist.

Use pie chart to see the proportion

In [11]:
category_pie = boston['description'].value_counts().sort_values(ascending=True)
category_pie.plot.pie(figsize=(10, 10),autopct='%.2f')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b82384590>

Pairplot

2.3 Use GROUP BY to see average humidity/ temperature/ pressure of different weather.

In [12]:
cursor.execute("select description,AVG(humidity) from Boston GROUP BY description")
query=cursor.fetchall()
df1=pd.DataFrame(query)
df1.sort_values(1,inplace=True)
df1.columns=['weather_description','AVG_humidity']
df1.plot(x='weather_description',y='AVG_humidity',kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=True)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b7c61b650>

From this bar chart we can see that normally rain,snow,fog weather have a high humidity.

In [13]:
cursor.execute("select description,AVG(pressure) from Boston GROUP BY description")
query=cursor.fetchall()
df2=pd.DataFrame(query)
df2.sort_values(1,inplace=True)
df2.columns=['weather_description','AVG_pressure']
df2.plot(x='weather_description',y='AVG_pressure',kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=True)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b828326d0>
In [14]:
cursor.execute("select description,AVG(temperature) from Boston GROUP BY description")
query=cursor.fetchall()
df3=pd.DataFrame(query)
df3.sort_values(1,inplace=True)
df3.columns=['weather_description','AVG_temperature']
df3.plot(x='weather_description',y='AVG_temperature',kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=True)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b7f329b10>

this chart is interesting, we can find three seperate stages. First, the lowest is snow weather in winter, of course the average temperature is relatively low; The highest, thounderstorm weather which usually happens in hot summer. As as for these bars in the middle, they don't have such clear season preference, so they are just at the average stage.

In [15]:
d1=df1.copy()
d2=df2.copy()
d3=df3.copy()
d1["AVG_humidity"]=(d1["AVG_humidity"]-d1["AVG_humidity"].min())/(d1["AVG_humidity"].max()-d1["AVG_humidity"].min())
d2["AVG_pressure"]=(d2["AVG_pressure"]-d2["AVG_pressure"].min())/(d2["AVG_pressure"].max()-d2["AVG_pressure"].min())
d3["AVG_temperature"]=(d3["AVG_temperature"]-d3["AVG_temperature"].min())/(d3["AVG_temperature"].max()-d3["AVG_temperature"].min())
In [16]:
d1=d1.merge(d2,on="weather_description")
d1=d1.merge(d3,on="weather_description")
In [17]:
d1.plot(x='weather_description',y=['AVG_humidity','AVG_pressure','AVG_temperature'],
        kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=False)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b82b6f590>

Clean data: delete rows including empty elements.

In [18]:
#clean data
bst=boston.copy()
bst.dropna(axis=0,how='any',inplace=True)          # delete rows including empty elements

bst['datetime']=pd.to_datetime(bst['datetime'])
bst=bst.set_index('datetime')
bst.head(20)
Out[18]:
humidity pressure temperature windspeed description
datetime
2012-10-01 13:00:00 68.0 1014.0 287.170013 3.0 sky is clear
2012-10-01 14:00:00 68.0 1014.0 287.186096 3.0 few clouds
2012-10-01 15:00:00 68.0 1014.0 287.231659 3.0 few clouds
2012-10-01 16:00:00 68.0 1014.0 287.277252 3.0 few clouds
2012-10-01 17:00:00 68.0 1014.0 287.322845 3.0 few clouds
2012-10-01 18:00:00 68.0 1014.0 287.368408 3.0 few clouds
2012-10-01 19:00:00 68.0 1014.0 287.414001 3.0 few clouds
2012-10-01 20:00:00 68.0 1014.0 287.459564 3.0 few clouds
2012-10-01 21:00:00 68.0 1014.0 287.505157 3.0 few clouds
2012-10-01 22:00:00 68.0 1014.0 287.550720 3.0 few clouds
2012-10-01 23:00:00 68.0 1014.0 287.596313 3.0 few clouds
2012-10-02 00:00:00 68.0 1014.0 287.641876 3.0 few clouds
2012-10-02 01:00:00 68.0 1014.0 287.687469 3.0 few clouds
2012-10-02 02:00:00 68.0 1014.0 287.733032 3.0 few clouds
2012-10-02 03:00:00 68.0 1014.0 287.778625 3.0 few clouds
2012-10-02 04:00:00 68.0 1014.0 287.824219 3.0 few clouds
2012-10-02 05:00:00 68.0 1014.0 287.869781 3.0 few clouds
2012-10-02 06:00:00 68.0 1014.0 287.915375 3.0 few clouds
2012-10-02 07:00:00 68.0 1014.0 287.960938 3.0 few clouds
2012-10-02 08:00:00 68.0 1014.0 288.006531 3.0 few clouds

2.4 To show how Boston weather features change with time.

In [19]:
bst['humidity'] = pd.to_numeric(bst['humidity'])
day_temperature=bst['humidity'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Humidity") 
Out[19]:
Text(0.5, 1.0, 'Humidity')

There is a unique peak in 2015, related to what we just learned from the humidity group by weather description, high humidity means high probablity of rain. Let's go further

In [20]:
cursor.execute("select * from Boston where description='light rain' or description='moderate rain' ")
light_rain=pd.DataFrame(cursor.fetchall())
light_rain.columns=["datetime","humidity","temperature","winddirection","windspeed","description"]
light_rain.dropna(axis=0,how='any',inplace=True)         
light_rain['datetime']=pd.to_datetime(light_rain['datetime'])
light_rain=light_rain.set_index('datetime')
pd.DataFrame(light_rain.groupby(light_rain.index.year).count()["description"])
Out[20]:
description
datetime
2012 203
2013 904
2014 1966
2015 1580
2016 1056
2017 782

we can see that 2014 and 2015 has the most of rain records

In [21]:
bst['pressure'] = pd.to_numeric(bst['pressure'])
day_temperature=bst['pressure'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Pressure") 
Out[21]:
Text(0.5, 1.0, 'Pressure')
In [22]:
cursor.execute("select * from Boston where description='snow' or description='heavy snow' ")
light_rain=pd.DataFrame(cursor.fetchall())
light_rain.columns=["datetime","humidity","temperature","winddirection","windspeed","description"]
light_rain.dropna(axis=0,how='any',inplace=True)         
light_rain['datetime']=pd.to_datetime(light_rain['datetime'])
light_rain=light_rain.set_index('datetime')
len(light_rain)
light_rain.index.month
pd.DataFrame(light_rain.groupby(light_rain.index.year).count()["description"])
Out[22]:
description
datetime
2012 29
2013 85
2014 22
2015 237
2016 7
2017 2
In [23]:
bst['windspeed'] = pd.to_numeric(bst['windspeed'])
day_temperature=bst['windspeed'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Windspeed") 
Out[23]:
Text(0.5, 1.0, 'Windspeed')

We can find that 2015 is a special year for Boston.

We search for the answer on the Internet.

In 2015 winter, Boston's 2.75-meter snowfall broke 140-year history due to continuous snowstorm.

In [24]:
bst['temperature'] = pd.to_numeric(bst['temperature'])
day_temperature=bst['temperature'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Temperature") 
Out[24]:
Text(0.5, 1.0, 'Temperature')

Temperature changes regularly every year.

3 Temperature Regression and Rain Prediction

Correlation between weather features.

In [25]:
correlations=bst.corr()
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,4,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names=['humidity','pressure','temperature','wind_speed']
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show() 
bst.corr()
Out[25]:
humidity pressure temperature windspeed
humidity 1.000000 0.121626 -0.102634 0.041359
pressure 0.121626 1.000000 -0.232521 -0.029699
temperature -0.102634 -0.232521 1.000000 -0.118974
windspeed 0.041359 -0.029699 -0.118974 1.000000
In [26]:
x = boston['humidity']
y = boston['pressure']
z = boston['temperature']
s = boston['windspeed']
t = boston['description']
p = sns.pairplot(pd.DataFrame(list(zip(x, y,z, s,t)), 
                        columns=['humidty', 'pressure', 'temperature', 'windspeed','description']), palette="Set1")
/home/cloudera/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:829: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/home/cloudera/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:830: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [27]:
bst['description'].value_counts()
Out[27]:
sky is clear                    17276
light rain                       4796
mist                             4728
overcast clouds                  4000
scattered clouds                 3429
broken clouds                    3294
few clouds                       2208
moderate rain                    1695
fog                              1317
light snow                        565
heavy intensity rain              424
snow                              263
haze                              233
light intensity drizzle           176
heavy snow                        119
proximity thunderstorm             95
drizzle                            63
very heavy rain                    38
thunderstorm                       31
thunderstorm with light rain       18
squalls                             7
thunderstorm with rain              6
thunderstorm with heavy rain        6
heavy intensity drizzle             5
dust                                5
freezing rain                       1
light rain and snow                 1
Name: description, dtype: int64

The weather descriptions are too specific so it's hard to predict.

We replace the specific weather description with simpler 0 or 1 to show it rains or not.

In [28]:
import sys
rain_forecast=bst.copy()
rain_forecast.replace({"sky is clear":0,"broken clouds":0,"mist":0,"scattered clouds":0,"overcast clouds":0,
                       "few clouds":0,'fog':0,'haze':0,'dust':0,'smoke':0,"squalls":0,"sand/dust whirls":0,
                       "sand":0,},inplace=True)
for i,j in enumerate(rain_forecast['description']):
    if (j!=0):
        rain_forecast.ix[i,'description']=1
rain_forecast['description'].value_counts()
/home/cloudera/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  
Out[28]:
0    36497
1     8302
Name: description, dtype: int64
In [29]:
rain_forecast
Out[29]:
humidity pressure temperature windspeed description
datetime
2012-10-01 13:00:00 68.0 1014.0 287.170013 3.0 0
2012-10-01 14:00:00 68.0 1014.0 287.186096 3.0 0
2012-10-01 15:00:00 68.0 1014.0 287.231659 3.0 0
2012-10-01 16:00:00 68.0 1014.0 287.277252 3.0 0
2012-10-01 17:00:00 68.0 1014.0 287.322845 3.0 0
... ... ... ... ... ...
2017-11-29 20:00:00 37.0 1017.0 288.079987 8.0 0
2017-11-29 21:00:00 74.0 1019.0 286.019989 6.0 0
2017-11-29 22:00:00 74.0 1019.0 283.940002 7.0 0
2017-11-29 23:00:00 56.0 1022.0 282.170013 2.0 0
2017-11-30 00:00:00 56.0 1023.0 280.649994 2.0 0

44799 rows × 5 columns

In [30]:
correlations=rain_forecast.corr()
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,5,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names=['humidity','pressure','temperature','wind_speed','rain_or_not']
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show() 
rain_forecast.corr()
Out[30]:
humidity pressure temperature windspeed description
humidity 1.000000 0.121626 -0.102634 0.041359 0.184845
pressure 0.121626 1.000000 -0.232521 -0.029699 -0.070772
temperature -0.102634 -0.232521 1.000000 -0.118974 -0.004075
windspeed 0.041359 -0.029699 -0.118974 1.000000 0.139975
description 0.184845 -0.070772 -0.004075 0.139975 1.000000
In [31]:
rain_forecast.rename(columns={"description":"rain_or_not"}, inplace=True)
rain_forecast 
Out[31]:
humidity pressure temperature windspeed rain_or_not
datetime
2012-10-01 13:00:00 68.0 1014.0 287.170013 3.0 0
2012-10-01 14:00:00 68.0 1014.0 287.186096 3.0 0
2012-10-01 15:00:00 68.0 1014.0 287.231659 3.0 0
2012-10-01 16:00:00 68.0 1014.0 287.277252 3.0 0
2012-10-01 17:00:00 68.0 1014.0 287.322845 3.0 0
... ... ... ... ... ...
2017-11-29 20:00:00 37.0 1017.0 288.079987 8.0 0
2017-11-29 21:00:00 74.0 1019.0 286.019989 6.0 0
2017-11-29 22:00:00 74.0 1019.0 283.940002 7.0 0
2017-11-29 23:00:00 56.0 1022.0 282.170013 2.0 0
2017-11-30 00:00:00 56.0 1023.0 280.649994 2.0 0

44799 rows × 5 columns

Resample dataframe to get the average/ min/ max value of data features of each day.

In [32]:
boston_mean = rain_forecast .resample('1D').mean() # min() or max()
boston_mean.rename(columns={"humidity":"meanhum", "pressure":"meanpres", "temperature":"meantemp","windspeed":"meanwsp"}, inplace=True)
boston_mean.drop(columns='rain_or_not',inplace=True)
boston_mean.head()
Out[32]:
meanhum meanpres meantemp meanwsp
datetime
2012-10-01 68.000000 1014.000000 287.371094 3.000000
2012-10-02 68.750000 1013.791667 289.013088 3.250000
2012-10-03 86.041667 1013.416667 289.020835 4.083333
2012-10-04 84.750000 1018.208333 290.043748 2.833333
2012-10-05 73.333333 1021.375000 289.517293 1.916667
In [33]:
boston_min = rain_forecast .resample('1D').min() # min() or max()
boston_min.rename(columns={"humidity":"minhum", "pressure":"minpres", "temperature":"mintemp","windspeed":"minwsp"}, inplace=True)
boston_min.drop(columns='rain_or_not',inplace=True)
boston_min.head()
Out[33]:
minhum minpres mintemp minwsp
datetime
2012-10-01 68.0 1014.0 287.170013 3.0
2012-10-02 64.0 1013.0 287.641876 3.0
2012-10-03 78.0 1012.0 286.119995 2.0
2012-10-04 69.0 1015.0 289.410004 0.0
2012-10-05 45.0 1020.0 288.559998 0.0
In [34]:
boston_max = rain_forecast.resample('1D').max()
boston_max.rename(columns={"humidity":"maxhum", "pressure":"maxpres", "temperature":"maxtemp","windspeed":"maxwsp"}, inplace=True)
boston_max.head()
Out[34]:
maxhum maxpres maxtemp maxwsp rain_or_not
datetime
2012-10-01 68.0 1014.0 287.596313 3.0 0
2012-10-02 78.0 1015.0 291.799988 5.0 0
2012-10-03 94.0 1015.0 293.709991 6.0 1
2012-10-04 94.0 1021.0 291.170013 5.0 1
2012-10-05 94.0 1023.0 290.390015 4.0 1

Merge all average/ min/ max data together.

In [35]:
b=pd.merge(boston_mean,boston_min,how='left',on='datetime')
b=pd.merge(b,boston_max,how='left',on='datetime')
b
Out[35]:
meanhum meanpres meantemp meanwsp minhum minpres mintemp minwsp maxhum maxpres maxtemp maxwsp rain_or_not
datetime
2012-10-01 68.000000 1014.000000 287.371094 3.000000 68.0 1014.0 287.170013 3.0 68.0 1014.0 287.596313 3.0 0
2012-10-02 68.750000 1013.791667 289.013088 3.250000 64.0 1013.0 287.641876 3.0 78.0 1015.0 291.799988 5.0 0
2012-10-03 86.041667 1013.416667 289.020835 4.083333 78.0 1012.0 286.119995 2.0 94.0 1015.0 293.709991 6.0 1
2012-10-04 84.750000 1018.208333 290.043748 2.833333 69.0 1015.0 289.410004 0.0 94.0 1021.0 291.170013 5.0 1
2012-10-05 73.333333 1021.375000 289.517293 1.916667 45.0 1020.0 288.559998 0.0 94.0 1023.0 290.390015 4.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-11-26 54.666667 1006.541667 281.485832 4.083333 35.0 1004.0 276.910004 1.0 81.0 1011.0 283.239990 11.0 0
2017-11-27 55.958333 1012.916667 276.827499 3.333333 38.0 1010.0 274.059998 1.0 69.0 1020.0 280.670013 8.0 0
2017-11-28 57.541667 1028.000000 272.183750 2.333333 32.0 1022.0 266.980011 1.0 100.0 1032.0 277.910004 5.0 0
2017-11-29 68.666667 1018.458333 282.885835 3.375000 30.0 1014.0 275.649994 1.0 93.0 1026.0 289.160004 9.0 0
2017-11-30 56.000000 1023.000000 280.649994 2.000000 56.0 1023.0 280.649994 2.0 56.0 1023.0 280.649994 2.0 0

1887 rows × 13 columns

In [36]:
b= b.round(2)
bb=b.copy()
bb
Out[36]:
meanhum meanpres meantemp meanwsp minhum minpres mintemp minwsp maxhum maxpres maxtemp maxwsp rain_or_not
datetime
2012-10-01 68.00 1014.00 287.37 3.00 68.0 1014.0 287.17 3.0 68.0 1014.0 287.60 3.0 0
2012-10-02 68.75 1013.79 289.01 3.25 64.0 1013.0 287.64 3.0 78.0 1015.0 291.80 5.0 0
2012-10-03 86.04 1013.42 289.02 4.08 78.0 1012.0 286.12 2.0 94.0 1015.0 293.71 6.0 1
2012-10-04 84.75 1018.21 290.04 2.83 69.0 1015.0 289.41 0.0 94.0 1021.0 291.17 5.0 1
2012-10-05 73.33 1021.38 289.52 1.92 45.0 1020.0 288.56 0.0 94.0 1023.0 290.39 4.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-11-26 54.67 1006.54 281.49 4.08 35.0 1004.0 276.91 1.0 81.0 1011.0 283.24 11.0 0
2017-11-27 55.96 1012.92 276.83 3.33 38.0 1010.0 274.06 1.0 69.0 1020.0 280.67 8.0 0
2017-11-28 57.54 1028.00 272.18 2.33 32.0 1022.0 266.98 1.0 100.0 1032.0 277.91 5.0 0
2017-11-29 68.67 1018.46 282.89 3.38 30.0 1014.0 275.65 1.0 93.0 1026.0 289.16 9.0 0
2017-11-30 56.00 1023.00 280.65 2.00 56.0 1023.0 280.65 2.0 56.0 1023.0 280.65 2.0 0

1887 rows × 13 columns

We need to predict tomorrow's average temperature, so we add a column 'tomor_meantemp'.

In [37]:
bb=bb.reset_index(drop=True)    
tomorrow_meantemp=bb[["meantemp"]]
tomorrow_meantemp.drop(0,inplace=True)
tomorrow_meantemp.reset_index(drop=True,inplace=True)     
tomorrow_meantemp.rename(columns={"meantemp":"tomor_meantemp"}, inplace=True)
tomorrow_meantemp
bb.drop(1886,inplace=True)
bb=pd.merge(bb,tomorrow_meantemp,how='left',left_index=True, right_index=True)
bb
/home/cloudera/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/home/cloudera/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4223: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)
Out[37]:
meanhum meanpres meantemp meanwsp minhum minpres mintemp minwsp maxhum maxpres maxtemp maxwsp rain_or_not tomor_meantemp
0 68.00 1014.00 287.37 3.00 68.0 1014.0 287.17 3.0 68.0 1014.0 287.60 3.0 0 289.01
1 68.75 1013.79 289.01 3.25 64.0 1013.0 287.64 3.0 78.0 1015.0 291.80 5.0 0 289.02
2 86.04 1013.42 289.02 4.08 78.0 1012.0 286.12 2.0 94.0 1015.0 293.71 6.0 1 290.04
3 84.75 1018.21 290.04 2.83 69.0 1015.0 289.41 0.0 94.0 1021.0 291.17 5.0 1 289.52
4 73.33 1021.38 289.52 1.92 45.0 1020.0 288.56 0.0 94.0 1023.0 290.39 4.0 1 291.15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1881 66.58 1008.62 279.97 3.38 28.0 1003.0 275.33 1.0 100.0 1015.0 287.80 6.0 0 281.49
1882 54.67 1006.54 281.49 4.08 35.0 1004.0 276.91 1.0 81.0 1011.0 283.24 11.0 0 276.83
1883 55.96 1012.92 276.83 3.33 38.0 1010.0 274.06 1.0 69.0 1020.0 280.67 8.0 0 272.18
1884 57.54 1028.00 272.18 2.33 32.0 1022.0 266.98 1.0 100.0 1032.0 277.91 5.0 0 282.89
1885 68.67 1018.46 282.89 3.38 30.0 1014.0 275.65 1.0 93.0 1026.0 289.16 9.0 0 280.65

1886 rows × 14 columns

In [38]:
from sklearn.metrics import explained_variance_score, \
    mean_absolute_error, \
    median_absolute_error
from sklearn.model_selection import train_test_split

features=["meanhum","minhum","maxhum","meanpres","minpres","maxpres","meanwsp","minwsp","maxwsp","meantemp","mintemp","maxtemp"]
X =bb[features]
y = bb[["tomor_meantemp"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=23)
X
Out[38]:
meanhum minhum maxhum meanpres minpres maxpres meanwsp minwsp maxwsp meantemp mintemp maxtemp
0 68.00 68.0 68.0 1014.00 1014.0 1014.0 3.00 3.0 3.0 287.37 287.17 287.60
1 68.75 64.0 78.0 1013.79 1013.0 1015.0 3.25 3.0 5.0 289.01 287.64 291.80
2 86.04 78.0 94.0 1013.42 1012.0 1015.0 4.08 2.0 6.0 289.02 286.12 293.71
3 84.75 69.0 94.0 1018.21 1015.0 1021.0 2.83 0.0 5.0 290.04 289.41 291.17
4 73.33 45.0 94.0 1021.38 1020.0 1023.0 1.92 0.0 4.0 289.52 288.56 290.39
... ... ... ... ... ... ... ... ... ... ... ... ...
1881 66.58 28.0 100.0 1008.62 1003.0 1015.0 3.38 1.0 6.0 279.97 275.33 287.80
1882 54.67 35.0 81.0 1006.54 1004.0 1011.0 4.08 1.0 11.0 281.49 276.91 283.24
1883 55.96 38.0 69.0 1012.92 1010.0 1020.0 3.33 1.0 8.0 276.83 274.06 280.67
1884 57.54 32.0 100.0 1028.00 1022.0 1032.0 2.33 1.0 5.0 272.18 266.98 277.91
1885 68.67 30.0 93.0 1018.46 1014.0 1026.0 3.38 1.0 9.0 282.89 275.65 289.16

1886 rows × 12 columns

We use LinearRegression Model to predict tomorrow's average temperature.

In [39]:
# LinearRegression Model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
features=["meanhum","minhum","maxhum","meanpres","minpres","maxpres",
          "meanwsp","minwsp","maxwsp","meantemp","mintemp","maxtemp"]
X =bb[features]
y = bb[["tomor_meantemp"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                    random_state=23)
model=LinearRegression()
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))
0.9002524225315898
0.8955078114366792
In [40]:
import matplotlib.pyplot as plt
plt.figure(figsize=(18,9))
plt.plot(list(range(943)),y_test,'b')
plt.plot(list(range(943)),model.predict(X_test),'r')
plt.title("Prediction test") 
plt.legend( ('test', 'prediction'))
plt.show() 
In [41]:
plt.scatter(y_test,model.predict(X_test))
Out[41]:
<matplotlib.collections.PathCollection at 0x7f6b7a19de10>

The prediction macthes the test data very well.

In [42]:
bb.describe().T 
Out[42]:
count mean std min 25% 50% 75% max
meanhum 1886.0 77.298971 13.147743 35.43 68.3300 77.980 86.6125 100.00
meanpres 1886.0 1019.859761 8.840649 989.71 1013.9700 1018.810 1025.1700 1052.29
meantemp 1886.0 283.785308 9.300951 254.87 276.3250 284.210 291.7600 301.78
meanwsp 1886.0 3.381368 1.614003 0.21 2.2500 3.170 4.2500 12.92
minhum 1886.0 55.790032 20.423200 11.00 40.0000 53.000 70.0000 100.00
minpres 1886.0 1014.608165 9.657900 978.00 1009.0000 1014.000 1020.0000 1050.00
mintemp 1886.0 279.951930 9.115021 249.54 272.9350 280.255 287.8200 298.39
minwsp 1886.0 1.200424 1.325578 0.00 0.0000 1.000 2.0000 10.00
maxhum 1886.0 94.172322 8.550769 47.00 92.0000 99.000 100.0000 100.00
maxpres 1886.0 1024.846766 8.796573 993.00 1019.0000 1024.000 1031.0000 1054.00
maxtemp 1886.0 288.285064 9.784484 261.17 280.2425 288.490 296.6075 308.00
maxwsp 1886.0 6.282078 2.522886 1.00 4.0000 6.000 8.0000 20.00
rain_or_not 1886.0 0.655885 0.475205 0.00 0.0000 1.000 1.0000 1.00
tomor_meantemp 1886.0 283.781744 9.300864 254.87 276.3250 284.185 291.7600 301.78
In [43]:
rain_forecast
Out[43]:
humidity pressure temperature windspeed rain_or_not
datetime
2012-10-01 13:00:00 68.0 1014.0 287.170013 3.0 0
2012-10-01 14:00:00 68.0 1014.0 287.186096 3.0 0
2012-10-01 15:00:00 68.0 1014.0 287.231659 3.0 0
2012-10-01 16:00:00 68.0 1014.0 287.277252 3.0 0
2012-10-01 17:00:00 68.0 1014.0 287.322845 3.0 0
... ... ... ... ... ...
2017-11-29 20:00:00 37.0 1017.0 288.079987 8.0 0
2017-11-29 21:00:00 74.0 1019.0 286.019989 6.0 0
2017-11-29 22:00:00 74.0 1019.0 283.940002 7.0 0
2017-11-29 23:00:00 56.0 1022.0 282.170013 2.0 0
2017-11-30 00:00:00 56.0 1023.0 280.649994 2.0 0

44799 rows × 5 columns

In [44]:
rain_forecast['temperature'] = pd.to_numeric(rain_forecast['temperature'])
rain_forecast['humidity'] = pd.to_numeric(rain_forecast['humidity'])
rain_forecast['pressure'] = pd.to_numeric(rain_forecast['pressure'])
rain_forecast['windspeed'] = pd.to_numeric(rain_forecast['windspeed'])

We use Perceptron Model to predict if it's rain or not.

In [45]:
from sklearn.linear_model import Perceptron
X_train=rain_forecast[["humidity","pressure","windspeed","temperature"]][0:33609].values
y_train=rain_forecast["rain_or_not"][0:33609].values
X_test=rain_forecast[["humidity","pressure","windspeed","temperature"]][33610:43609].values
y_test=rain_forecast["rain_or_not"][33610:43609].values
y_train=list(y_train)
y_test=list(y_test)

clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X_train,y_train)
Out[45]:
Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)
In [46]:
print('train dataset score:')
print(clf.score(X_train,y_train))
print('test dataset score:')
print(clf.score(X_test,y_test))
train dataset score:
0.7978517658960398
test dataset score:
0.841984198419842
In [47]:
predict=clf.predict(X_test)
k=0
for i in zip(y_test,predict):
    if list(i)[0]!=list(i)[1]:
        k=k+1
print(k,len(y))
1580 1886

4. American states weather features distribution

4.1 plot distribution of average temperature

In [48]:
import plotly.graph_objects as go
In [49]:
cursor.execute("select * from temperature")
humidity=pd.DataFrame(cursor.fetchall(),columns=['datetime','Vancouver','Portland',
                                                 'SanFrancisco','Seattle','LosAngeles','SanDiego',
                                                 'LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
                                                 'Dallas','Houston','KansasCity','Minneapolis','SaintLouis',
                                                 'Chicago','Nashville','Indianapolis','Atlanta','Detroit',
                                                 'Jacksonville','Charlotte','Miami','Pittsburgh',
                                                 'Toronto INT','Philadelphia','NewYork','Montreal',
                                                 'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa',
                                                 'Nahariyya','Jerusalem'])
cursor.execute("select AVG(Vancouver),AVG(Portland),AVG(SanFrancisco),AVG(Seattle),AVG(LosAngeles)\
,AVG(SanDiego),AVG(LasVegas),AVG(Phoenix),AVG(Albuquerque),AVG(Denver),AVG(SanAntonio),AVG(Dallas)\
,AVG(Houston),AVG(KansasCity),AVG(Minneapolis),AVG(SaintLouis),AVG(Chicago),AVG(Nashville)\
,AVG(Indianapolis),AVG(Atlanta),AVG(Detroit),AVG(Jacksonville),AVG(Charlotte),AVG(Miami),AVG(Pittsburgh)\
,AVG(Toronto),AVG(Philadelphia),AVG(NewYork),AVG(Montreal),AVG(Boston),AVG(Beersheba),AVG(TelAvivDistrict)\
,AVG(Eilat),AVG(Haifa),AVG(Nahariyya),AVG(Jerusalem) from temperature")
avg_temp=cursor.fetchall()
avg_temp_c=np.array(avg_temp)-273.15
In [50]:
geo_avg_temp=pd.DataFrame(columns=["avg_temp","city_name","geo_code"])
geo_avg_temp["avg_temp"]=avg_temp_c.reshape(-1,)
geo_avg_temp["city_name"]=['Vancouver','Portland','SanFrancisco','Seattle','LosAngeles',
                           'SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
                           'Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago',
                           'Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte',
                           'Miami','Pittsburgh','Toronto','Philadelphia','NewYork','Montreal',
                           'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']
geo_avg_temp["geo_code"]=['Vancouver','Portland','CA','WA','LA','CA','NV',
                          'AZ','NM','CO','TX','TX','TX',
                          'MO','MN','MO','IL','TN','IN',
                          'GA','MI','FL','NC','FL','PA','ON',
                          'PA','NY','Montreal','MA','Beersheba','TelAvivDistrict','Eilat',
                          'Haifa','Nahariyya','Jerusalem']
In [51]:
geo_avg_temp
Out[51]:
avg_temp city_name geo_code
0 10.712654 Vancouver Vancouver
1 11.842929 Portland Portland
2 15.005821 SanFrancisco CA
3 11.259626 Seattle WA
4 17.696116 LosAngeles LA
5 17.065044 SanDiego CA
6 19.274887 LasVegas NV
7 22.343358 Phoenix AZ
8 12.467856 Albuquerque NM
9 9.689407 Denver CO
10 20.635646 SanAntonio TX
11 19.225872 Dallas TX
12 21.054268 Houston TX
13 13.491663 KansasCity MO
14 7.550489 Minneapolis MN
15 13.525947 SaintLouis MO
16 10.200573 Chicago IL
17 15.416306 Nashville TN
18 11.621904 Indianapolis IN
19 16.622673 Atlanta GA
20 9.821668 Detroit MI
21 21.183380 Jacksonville FL
22 15.747483 Charlotte NC
23 25.031274 Miami FL
24 10.903939 Pittsburgh PA
25 8.791241 Toronto ON
26 12.224168 Philadelphia PA
27 12.250406 NewYork NY
28 7.193010 Montreal Montreal
29 10.629823 Boston MA
30 18.371987 Beersheba Beersheba
31 21.362306 TelAvivDistrict TelAvivDistrict
32 23.347275 Eilat Eilat
33 22.116398 Haifa Haifa
34 20.944803 Nahariyya Nahariyya
35 20.034253 Jerusalem Jerusalem
In [52]:
new=pd.DataFrame(geo_avg_temp.groupby("geo_code").mean())
new['avg_temp']
Out[52]:
geo_code
AZ                 22.343358
Beersheba          18.371987
CA                 16.035432
CO                  9.689407
Eilat              23.347275
FL                 23.107327
GA                 16.622673
Haifa              22.116398
IL                 10.200573
IN                 11.621904
Jerusalem          20.034253
LA                 17.696116
MA                 10.629823
MI                  9.821668
MN                  7.550489
MO                 13.508805
Montreal            7.193010
NC                 15.747483
NM                 12.467856
NV                 19.274887
NY                 12.250406
Nahariyya          20.944803
ON                  8.791241
PA                 11.564054
Portland           11.842929
TN                 15.416306
TX                 20.305262
TelAvivDistrict    21.362306
Vancouver          10.712654
WA                 11.259626
Name: avg_temp, dtype: float64
In [53]:
fig = go.Figure(data=go.Choropleth(
    locations=new.index, # Spatial coordinates
    z = new["avg_temp"], # Data to be color-coded
    colorscale = 'Reds',
    locationmode = 'USA-states',
    colorbar_title = "temperature",
))
fig.update_layout(
    title_text = 'Average temperature',
    geo_scope='usa', # limite map scope to USA
)
fig.show()

4.2 plot distribution of average humidity

In [54]:
cursor.execute("select * from humidity")
humidity=pd.DataFrame(cursor.fetchall(),columns=['datetime','Vancouver','Portland',
                                                 'SanFrancisco','Seattle','LosAngeles','SanDiego',
                                                 'LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
                                                 'Dallas','Houston','KansasCity','Minneapolis','SaintLouis',
                                                 'Chicago','Nashville','Indianapolis','Atlanta','Detroit',
                                                 'Jacksonville','Charlotte','Miami','Pittsburgh',
                                                 'Toronto INT','Philadelphia','NewYork','Montreal',
                                                 'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa',
                                                 'Nahariyya','Jerusalem'])
cursor.execute("select AVG(Vancouver),AVG(Portland),AVG(SanFrancisco),AVG(Seattle),AVG(LosAngeles)\
,AVG(SanDiego),AVG(LasVegas),AVG(Phoenix),AVG(Albuquerque),AVG(Denver),AVG(SanAntonio),AVG(Dallas)\
,AVG(Houston),AVG(KansasCity),AVG(Minneapolis),AVG(SaintLouis),AVG(Chicago),AVG(Nashville)\
,AVG(Indianapolis),AVG(Atlanta),AVG(Detroit),AVG(Jacksonville),AVG(Charlotte),AVG(Miami),AVG(Pittsburgh)\
,AVG(Toronto),AVG(Philadelphia),AVG(NewYork),AVG(Montreal),AVG(Boston),AVG(Beersheba),AVG(TelAvivDistrict)\
,AVG(Eilat),AVG(Haifa),AVG(Nahariyya),AVG(Jerusalem) from humidity")
avg_h=cursor.fetchall()
In [55]:
geo_avg_h=pd.DataFrame(columns=["avg_h","city_name","geo_code"])
geo_avg_h["avg_h"]=np.array(avg_h).reshape(-1,)
geo_avg_h["city_name"]=['Vancouver','Portland','SanFrancisco','Seattle','LosAngeles',
                           'SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
                           'Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago',
                           'Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte',
                           'Miami','Pittsburgh','Toronto','Philadelphia','NewYork','Montreal',
                           'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']
geo_avg_h["geo_code"]=['Vancouver','Portland','CA','WA','LA','CA','NV',
                          'AZ','NM','CO','TX','TX','TX',
                          'MO','MN','MO','IL','TN','IN',
                          'GA','MI','FL','NC','FL','PA','ON',
                          'PA','NY','Montreal','MA','Beersheba','TelAvivDistrict','Eilat',
                          'Haifa','Nahariyya','Jerusalem']
In [56]:
geo_avg_h
Out[56]:
avg_h city_name geo_code
0 81.895480 Vancouver Vancouver
1 74.697616 Portland Portland
2 76.875042 SanFrancisco CA
3 77.159038 Seattle WA
4 62.773841 LosAngeles LA
5 67.784809 SanDiego CA
6 31.937831 LasVegas NV
7 37.484424 Phoenix AZ
8 45.186157 Albuquerque NM
9 53.022557 Denver CO
10 67.700799 SanAntonio TX
11 64.323764 Dallas TX
12 74.065585 Houston TX
13 66.629736 KansasCity MO
14 71.244172 Minneapolis MN
15 70.598763 SaintLouis MO
16 74.434918 Chicago IL
17 68.240344 Nashville TN
18 72.383859 Indianapolis IN
19 70.846758 Atlanta GA
20 72.477574 Detroit MI
21 76.439481 Jacksonville FL
22 70.189190 Charlotte NC
23 75.512362 Miami FL
24 70.364289 Pittsburgh PA
25 76.361909 Toronto ON
26 68.017769 Philadelphia PA
27 66.642417 NewYork NY
28 71.861538 Montreal Montreal
29 77.375301 Boston MA
30 70.604857 Beersheba Beersheba
31 66.861509 TelAvivDistrict TelAvivDistrict
32 53.155184 Eilat Eilat
33 79.800383 Haifa Haifa
34 78.606760 Nahariyya Nahariyya
35 68.732293 Jerusalem Jerusalem
In [57]:
humidity.drop(columns="datetime").mean(axis=0)
new_h=pd.DataFrame(geo_avg_h.groupby("geo_code").mean())
new_h['avg_h']
Out[57]:
geo_code
AZ                 37.484424
Beersheba          70.604857
CA                 72.329926
CO                 53.022557
Eilat              53.155184
FL                 75.975922
GA                 70.846758
Haifa              79.800383
IL                 74.434918
IN                 72.383859
Jerusalem          68.732293
LA                 62.773841
MA                 77.375301
MI                 72.477574
MN                 71.244172
MO                 68.614249
Montreal           71.861538
NC                 70.189190
NM                 45.186157
NV                 31.937831
NY                 66.642417
Nahariyya          78.606760
ON                 76.361909
PA                 69.191029
Portland           74.697616
TN                 68.240344
TX                 68.696716
TelAvivDistrict    66.861509
Vancouver          81.895480
WA                 77.159038
Name: avg_h, dtype: float64
In [58]:
fig = go.Figure(data=go.Choropleth(
    locations=new_h.index, # Spatial coordinates
    z = new_h["avg_h"], # Data to be color-coded
    colorscale = 'Blues',
    locationmode = 'USA-states',
    colorbar_title = "humidity",
))
fig.update_layout(
    title_text = 'Average humidity',
    geo_scope='usa', # limite map scope to USA
)
fig.show()

Conclusion:

The Hourly Dataset contains too much information. The exploration follows in these main parts.

1.We extract weather features from each table and insert into a joint table Boston to explore a specific city's weather situation, such as the most frequent weather, average weather features group by weather description, the curves of these weather features in time sequence and we found some interesting conclusion such as the relationship between weather conditions and these numerical features and the 2015 Boston Snowstorm event and so on.

2.we try to go furture in Boston dataset by using machine learning. We did temprature regression and rain prediction and the outcomes were satisfying.

3.we try to explore the potential of these data by using plotly library. We use hive to calculate the average of these numerical features like temperature of different cities and show their distribution on map